Red Wine Quality Exploratory Analysis by Chris Eldredge

This data set contains 1,599 red wines with 11 variables for the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).

Univariate Plots Section

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Fixed acidity seems somewhat normally distributed, with a slight right skew. There are a few outliers. Mean value is 8.32, and median is 7.90.

Volatile acidity also seems somewhat normally distributed, with a few outliers. The mean is 0.5278, and median 0.5200.

Wines showed a wide range of citric acid levels. There are a few outliers. The mean value is 0.271, and the meidan is 0.260.

The chart suggests that most wines contained residual sugar a bit less than 2 g / dm^3. However, there were outliers with much higher residual sugar content. The maximum value was 15. The median residual sugar is 2.2, but the mean value, 2.5, was higher due to the outliers.

The histogram shows most wines in the dataset have chlorides in a narrow range, with the most frequent value being slightly less than 0.1 g / dm^3. However, there are some outliers with higher values. The median value is 0.079, but the mean is influenced by outliers and the slightly higher value of 0.087.

The distribution of wines by free.sulfur.dioxide appears right skewed. There are a few outliers with higher values. The mean is 15.87, and median 14.00.

The distribution of wines by total.sulfur.dioxide appears right skewed, similar to free.sulfur.dioxide. There are a few outliers. The mean value is 46.47, the median is 38.00.

The distribution for density appears normal and unimodal. There are relatively few outliers. The mean value, 0.9967, and median value, 0.9968, are nearly identical.

The distribution for pH also appears normal and unimodal. There are a few outliers on both sides of the distribution. The mean value is 3.311, and the median value is 3.310.

The distribution appears unimodal and somewhat normal. There are several outliers toward the right tail. The mean value is 0.6581, and the median is 0.6200.

Most wines appear to have an alcohol content of at least 9 percent of volume, with higher alcohol content less common. There are a few outliers. The mean value is 10.42, and the median value is 10.20.

Although the quality score ranges between 0 and 10, it appears that the dataset only contains wines with scores between 3 and 8. The overwhelming majority of wines were in the mid/high-mid range, 5, 6 and some 7.

Univariate Analysis

What is the structure of your dataset?

The dataset contains 1599 red wines along with 11 descriptive attributes that describe the wine, and an output variable with an expert rating indicating quality on a scale of 0 (very bad) to 10 (very excellent). At least 3 experts rated each wine.

The dataset includes the attributes below recorded in the respective units. When the original dataset was loaded into R, the quality variable was type integer, and all other variables were of type numeric.

1 - fixed acidity (tartaric acid - g / dm^3) 2 - volatile acidity (acetic acid - g / dm^3) 3 - citric acid (g / dm^3) 4 - residual sugar (g / dm^3) 5 - chlorides (sodium chloride - g / dm^3 6 - free sulfur dioxide (mg / dm^3) 7 - total sulfur dioxide (mg / dm^3) 8 - density (g / cm^3) 9 - pH 10 - sulphates (potassium sulphate - g / dm3) 11 - alcohol (% by volume) Output variable (based on sensory data): 12 - quality (score between 0 and 10)

What is/are the main feature(s) of interest in your dataset?

The main feature of the dataset is how the quality rating changes as the chemical properties vary. Which chemical properties appear to influence the quality of red wines? E.g. Did wines with a higher alcohol content (% by volume) receive higher quality ratings?

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

It may make sense to analyze potentially related variables. For example, perhaps the dataset shows similar patterns between pH, fixed.acidity, and volatile.acidity due to these variables all in some form indicating the acidity of the wine. Similarly, it may be worthwhile to compare free.sulfur.dioxide and total.sulfur.dioxide.

Did you create any new variables from existing variables in the dataset?

No, it did not seem necessary to create any new variables.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Yes, I changed the quality variable to a factor type to ensure that each level of quality would be plotted in an easy-to-read way for the box plot chart.

Bivariate Plots Section

## 
##  Pearson's product-moment correlation
## 
## data:  rw$alcohol and as.numeric(rw$quality)
## t = 21.6395, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663

The scatter plot of quality by alcohol content shows that most wines rated with quality of 5 have alcohol less than 10 percent alcohol content. There seems to be more variation in alcohol content among wines rated 6 or higher. The correlation is moderately positive, r squared ~0.48.

## 
##  Pearson's product-moment correlation
## 
## data:  rw$volatile.acidity and as.numeric(rw$quality)
## t = -16.9542, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578

This scatter plot of volatile acidity appears to show a slight decreasing trend, with wines having less volatile acidity as the rating increases. This decreasing trend is consistent with the negative correlation, r squared -0.39

A boxplot of the same volatile acidity data makes the decreasing trend more apparent. Volatile acidity decreases as quality increases, and seems to level-off at a quality score of 7.

## 
##  Pearson's product-moment correlation
## 
## data:  rw$fixed.acidity and as.numeric(rw$quality)
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07548957 0.17202667
## sample estimates:
##       cor 
## 0.1240516
## 
##  Pearson's product-moment correlation
## 
## data:  rw$volatile.acidity and as.numeric(rw$quality)
## t = -16.9542, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578
## 
##  Pearson's product-moment correlation
## 
## data:  rw$citric.acid and as.numeric(rw$quality)
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1793415 0.2723711
## sample estimates:
##       cor 
## 0.2263725

I plotted all “acid” related variables to see if there are any similarities in distribution. At a glace, the plots for volatile and fixed acidity seem similar, with most wines scorred 5 or 6 in quality clustering within a narrow range of acidity. However, the plot for citric acid does not show the same pattern.

Based on reviewing the r squared values, fixed.acidity is slightly positively correlated with quality (r squared 0.12), citric acid is also slightly positively correlated (r squared 0.23), and as seen before, volatile acidity is negatively correlated (r squared -0.39).

The boxplot seems easier to read. The same variables plotted with a boxplots show that median volatile acidity decreases slightly as quality increases, whereas median fixed acidity increases slightly as quality increases. Citric acid shows much wider variance, and the median citric acid g / dm^3 generally increases as quality score increases.

## 
##  Pearson's product-moment correlation
## 
## data:  rw$residual.sugar and as.numeric(rw$quality)
## t = 0.5488, df = 1597, p-value = 0.5832
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03531327  0.06271056
## sample estimates:
##        cor 
## 0.01373164
## 
##  Pearson's product-moment correlation
## 
## data:  rw$chlorides and as.numeric(rw$quality)
## t = -5.1948, df = 1597, p-value = 2.313e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.17681041 -0.08039344
## sample estimates:
##        cor 
## -0.1289066

The residual sugar and chlorides in wine both appear to be relatively consistent across quality scores. The r squared value for residual sugar shows nearly no relationship to quality (r squared 0.014), and for chlorides a slight negative relationship (-0.13 r squared).

## 
##  Pearson's product-moment correlation
## 
## data:  rw$free.sulfur.dioxide and as.numeric(rw$quality)
## t = -2.0269, df = 1597, p-value = 0.04283
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.099430290 -0.001638987
## sample estimates:
##         cor 
## -0.05065606
## 
##  Pearson's product-moment correlation
## 
## data:  rw$total.sulfur.dioxide and as.numeric(rw$quality)
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2320162 -0.1373252
## sample estimates:
##        cor 
## -0.1851003

Both free and total sulfer dioxide levels have a fair amount of variance across quality scores. The boxplots suggest there is not a strong correlation with quality, and the correlation coefficient confirms this, free sulfur dioxide shows nearly no correlation with quality (r squared -0.05), and total sulfur dioxide shows a slight negative correlation (r squared -0.19).

## 
##  Pearson's product-moment correlation
## 
## data:  rw$density and as.numeric(rw$quality)
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2220365 -0.1269870
## sample estimates:
##        cor 
## -0.1749192
## 
##  Pearson's product-moment correlation
## 
## data:  rw$pH and as.numeric(rw$quality)
## t = -2.3109, df = 1597, p-value = 0.02096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.106451268 -0.008734972
## sample estimates:
##         cor 
## -0.05773139

Density seems relatively consistent across quality scores, with median density decreasing slightly at the highest quality scores. Likewise, pH shows a similar pattern, consistent across most scores, a slight decrease in median pH at the highest quality scores, and higher variance.

The slight decreasing trend is confirmed by the correlation to quality score. Density shows a slightly negative correlation, -0.17 r squared. pH shows a slightly negative correlation, too, -0.057 r squared.

## 
##  Pearson's product-moment correlation
## 
## data:  rw$sulphates and as.numeric(rw$quality)
## t = 10.3798, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2049011 0.2967610
## sample estimates:
##       cor 
## 0.2513971
## 
##  Pearson's product-moment correlation
## 
## data:  rw$alcohol and as.numeric(rw$quality)
## t = 21.6395, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663

Median sulphate seemed to increase as quality increased. This trend is confirmed by the slightly positive correlation coefficient of 0.25 r squared. There were a fair number of outliers for quality scores 5 and 6.

Median percent alcohol content increased as the quality of the wine increased. This appears to be one of the strongest relationships of any of the attributes in the dataset. The moderately positive correlation coefficient of 0.48 confirms this trend.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

I found it interesting that the plots suggest that one of the strongest relationships is between alcohol content and wine rating. As alcohol content increased, so did wine rating. Additionally, I found the opposite relationships between fixed (increased as rating increased) and volatile (decreased as rating increased) acidity interesting, because both measure acid in some form. However, from reviewing the notes about high levels of volatile acidity causing an unpleasant, vinegar taste, so this relationship makes sense.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

What was the strongest relationship you found?

The strongest relationship appears to be between wine rating and alcohol content. As alcohol content increased, so did the rating. The Pearson correlation coefficient shows a moderately positive relationship (Pearson’s product-moment correlation r = 0.4761663).

Multivariate Plots Section

This scatter plot of fixed acidity and volatile acidity shows most wines in the data set clustering within a similar overall range, and higher-rated wines tending to have lower scores, particularly for volatile acidity, which is associated with an unpleasant vinegar taste.

This scatter plot suggests that the highest rated wines tend to have relatively lower volatile acidity and relatively higher alcohol content.

However, plotting the smoothed conditional mean removes the overplotting and highlights the trends more clearly. The smoothed plot highlights that higher quality wines tend to have both higher alcohol content and lower volatile acidity, although there is some variation.

Although there don’t appear to be any extremely strong relationships, as the bivariate analysis showed, the highest-rated wines seem to have higher citric acid.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

I chose to focus on attributes that stood out in the bivariate analysis. I found it interesting that plotting both lower volatile acidity and alcohol, the plot strongly suggests that suggests that the highest rated wines tend to have both relatively high alcohol but low volatile acidity.

Were there any interesting or surprising interactions between features?

I did not expect the smoothed plot for volatile acidity and alcohol content to be helpful, but thought it was good because it draws attentoin to both variation and the means for alcohol content and volatile acidity.


Final Plots and Summary

Plot One

Description One

This histogram gives a sense of the number of red wines in the dataset, and highlights that only part of the ten point rating scale was used. By far the most frequent ratings are 5 and 6.

Plot Two

Description Two

This boxplot shows that higher rated wines tend to have lower amounts of volatile acidity. This relationship is supported by a moderately negative correlation coefficient of -0.39 between volatile acidity and quality

Plot Three

Description Three

This smoothed conditional mean of quality rating by alcohol and volatile acidity shows that the highest rated wines tend to have both the highest alcohol content and the lowest volatile acidity. The correlation coefficients, positive ~0.48 for alcohol and negative -0.39 for volatile acidity confirm this trend.


Reflection

Working through the exploratory data analysis project I ocassionally ran into difficulties with the syntax and data types needed to create the plots I intended. However, through researching and reviewing the course notes I was usually able to find a way to put together an appropriate plot.

When I first started working with the dataset I read through the variable descriptions and developed a few ideas about what might influence quality, but it wasn’t until I started making an initial set of univariate and bivariate plots that I thought of additional ideas and relationships to explore.

This analysis provides a good high level introduction to the dataset and possible relationships. Next steps for the analysis could include building a regression model to predict wine quality based on input of various chemical attributes.